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Abstract: 

The strength of association between a pair of data vectors is represented by a nonneg¬ 
ative real number, called matching weight. For dimensionality reduction, we consider a 
linear transformation of data vectors, and define a matching error as the weighted sum 
of squared distances between transformed vectors with respect to the matching weights. 
Given data vectors and matching weights, the optimal linear transformation minimizing 
the matching error is solved by the spectral graph embedding of Yan et al. (2007). This 
method is a generalization of the canonical correlation analysis, and will be called as 
matching correlation analysis (MCA). In this paper, we consider a novel sampling scheme 
where the observed matching weights are randomly sampled from underlying true match¬ 
ing weights with small probability, whereas the data vectors are treated as constants. 

We then investigate a cross-validation by resampling the matching weights. Our asymp¬ 
totic theory shows that the cross-validation, if rescaled properly, computes an unbiased 
estimate of the matching error with respect to the true matching weights. Existing ideas 
of cross-validation for resampling data vectors, instead of resampling matching weights, 
are not applicable here. MCA can be used for data vectors from multiple domains with 
different dimensions via an embarrassingly simple idea of coding the data vectors. This 
method will be called as cross-domain matching correlation analysis (CDMCA), and an 
interesting connection to the classical associative memory model of neural networks is 
also discussed. 
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1. Introduction 

We have N data vectors of P dimensions. Let aq,..., Xn G IR p be the data vectors, and X = 
(aq ,... ,xn) t G R NxP be the data matrix. We also have matching weights between the data 
vectors. Let Wij = Wji > 0, i, j — 1,..., N, be the matching weights, and W = () G R NxN 
be the matching weight matrix. The matching weight w %1 represents the strength of association 
between x t and x r For dimensionality reduction, we will consider a linear transformation from 
M p to M A for some K < P as 


y l = A r x i , i = 1,..., N, 

or Y = XA, where A G R PxK is the linear transformation matrix, y i,..., y n G M a are the 
transformed vectors, and Y = (y \,..., yu) 1 G R NxK is the transformed matrix. Observing X 
and W, we would like to find A that minimizes the matching error 

1 N N 

<P = - Vj\\ 2 

2=1 3 = 1 

under some constraints. We expect that the distance between yi and yj will be small when 
is large, so that the locations of transformed vectors represent both the locations of the 
data vectors and the associations between data vectors. The optimization problem for finding 
A is solved by the spectral graph embedding for dimensionality reduction of Yan et al. (2007). 
Similarly to principal component analysis (PCA), the optimal solution is obtained as the eigen¬ 
vectors of the largest K eigenvalues of some matrix computed from X and W. In Section 3, 
this method will be formulated by specifying the constraints on the transformed vectors and 
also regularization terms for numerical stability. We will call the method as matching correla¬ 
tion analysis (MCA), since it is a generalization of the classical canonical correlation analysis 
(CCA) of Hotelling (1936). The matching error will be represented by matching correlations of 
transformed vectors, which correspond to the canonical correlations of CCA. 

MCA will be called as cross-domain matching correlation analysis (CDMCA) when we have 
data vectors from multiple domains with different sample sizes and different dimensions. Let 
D be the number of domains, and d = 1,..., D denote each domain. For example, domain 
d = 1 may be for image feature vectors, and domain d = 2 may be for word vectors computed 
by word2vec (Mikolov et ah, 2013) from texts, where the matching weights between the two 
domains may represent tags of images in a large dataset, such as Flickr. From domain d, 
we get data vectors xf ) G M Pd , i = 1,..., n^, where riy is the number of data vectors, and 
Pd is the dimension of the data vector. Typically, pd is hundreds, and Ud is thousands to 
millions. We would like to retrieve relevant words from an image query, and alternatively retrieve 
images from a word query. Given matching weights across/within domains, we attempt to find 
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linear transformations of data vectors from multiple domains to a “common space” of lower 
dimensionality so that the distances between transformed vectors well represent the matching 
weights. This problem is solved by an embarrassingly simple idea of coding the data vectors, 
which is similar to that of Daume III (2009). Each data vector from domain d is represented by 
an augmented data vector a;* of dimension P = Yld=iPdi where only Pd dimensions are for the 
original data vector and the rest of P — pd dimensions are padded by zeros. In the case of D = 2 
with pi = 2 , P 2 — 3, say, a data vector (l,2) r of domain 1 is represented by (1,2, 0, 0, 0) T , 
and (3,4, 5) T of domain 2 is represented by (0, 0, 3,4,5) T . The number of total augmented data 
vectors is N = n ( [. Note that the above mentioned “embarrassingly simple coding” is not 

actually implemented by padding zeros in computer software; only the nonzero elements are 
stored in memory, and CDMCA is in fact implemented very efficiently for sparse W. CDMCA is 
illustrated in a numerical example of Section 2. CDMCA is further explained in Appendix A.l, 
and an interesting connection to the classical associative memory model of neural networks 
(Kohonen, 1972; Nakano, 1972) is also discussed in Appendix A.2. 

CDMCA is solved by applying the single-domain version of MCA described in Section 3 to 
the augmented data vectors, and thus we only discuss the single-domain version in this paper. 
This formulation of CDMCA includes a wide class of problems of multivariate analysis, and 
similar approaches are very popular recently in pattern recognition and vision (Correa et al., 
2010; Yuan et ah, 2011; Kan et ah, 2012; Shi et ah, 2013; Wang et ah, 2013; Gong et ah, 2014; 
Yuan and Sun, 2014). CDMCA is equivalent to the method of Nori, Bollegala and Kashima 
(2012) for multinomial relation prediction if the matching weights are defined by cross-products 
of the binary matrices representing relations between objects and instances. CDMCA is also 
found in Huang et ah (2013) for the case of D = 2. CDMCA reduces to the multi-set canonical 
correlation analysis (MCCA) (Kettenring, 1971; Takane, Hwang and Abdi, 2008; Tenenhaus 
and Tenenhaus, 2011) when m — ■ ■ ■ — nu with cross-domain matching weight matrices being 
proportional to the identity matrix. It becomes the classical CCA by further letting D = 2, or 
it becomes PCA by letting Pi = P 2 = • • • = Pd = 1- 

In this paper, we discuss a cross-validation method for computing the matching error of 
MCA. In Section 4, we will define two types of matching errors, i.e., fitting error and true 
error, and introduce cross-validation (cv) error for estimating the true error. In order to argue 
distributional properties of MCA, we consider the following sampling scheme. First, the data 
vectors are treated as constants. Similarly to the explanatory variables in regression analysis, 
we perform conditional inference given data matrix X, although we do not avoid assuming 
that cchs are sampled from some probability distribution. Second, the matching weights Wij are 
randomly sampled from underlying true matching weights w t] with small probability e > 0. 
The value of e is unknown and it should not be used in our inference. Let Zij = Zji G {0,1}, 
i, j = 1,..., N, be samples from Bernoulli trial with success probability e, where the number of 
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independent elements is N(N + 1)/2 due to the symmetry. Then the observed matching weights 
are defined as 

tttij ZijWij, P (Zjj 1 ) 6. (1) 

The true matching weight matrix W = () G WL NxN is treated as an unknown constant 
matrix with elements Wij = Wji > 0. This setting will be appropriate for a large-scale data, 
such as those obtained automatically from the web, where only a small portion W of the true 
association W may be obtained as our knowledge. 

In Section 4.2, we will consider a resampling scheme corresponding to (1). For the cross- 
validation, we resample W* from W with small probability n > 0, whereas X is left untouched. 
Our sampling/resampling scheme is very unique in the sense that the source of randomness is 
W instead of X , and existing results of cross-validation for resampling from X such as Stone 
(1977) and Golub, Heath and Wahba (1979) are not applicable here. The traditional method 
of resampling data vectors is discussed in Section 4.3. 

The true error is defined with respect to the unknown W, and the fitting error is defined with 
respect to the observed W. We would like to look at the true error for finding appropriate values 
of the regularization terms (regularization parameters are generally denoted as 7 throughout) 
and the dimension K of the transformed vectors. However, the true error is unavailable, and 
the fitting error is biased for estimating the true error. The main thrust of this paper is to show 
asymptotically that the cv error, if rescaled properly, is an unbiased estimator of the true error. 
The value of e is unnecessary for computing the cv error, but W should be a sparse matrix. 
The unbiasedness of the cv error is illustrated by a simulation study in Section 5, and it is 
shown theoretically by the asymptotic theory of N —> 00 in Section 6. 

2. Illustrative example 

Let us see an example of CDMCA applied to the MNIST database of handwritten digits (see 
Appendix B.l for the experimental details). The number of domains is D = 3 with the number 
of vectors n\ = 60, 000, 712 = 10, n 3 = 3, and dimensions p\ = 2784, p 2 = 100, p% = 50. The 
handwritten digit images are stored in domain d — 1 , while domain d = 2 is for the digit labels 
“zero”, “one”, ... , “nine”, and domain d = 3 is for attribute labels “even”, “odd”, “prime”. 
This CDMCA is also interpreted as MCA with N = 60, 013 and P = 2934. 

The elements of W are simply the indicator variables (called dummy variables in statistics) 
of image labels. Instead of working on W, here we made W by sampling 20% of the elements 
from W for illustrating how CDMCA works. The optimal A is computed from W using the 
method described in Section 3.3 with regularization parameter 7 m = 0.1. The data matrix 
X is centered, and the transformed matrix Y is rescaled. The first and second elements of 
yi, namely, (y?a, 7 , 2 ), i = 1, ...,1V, are shown in Fig. 1. For the computation of A, we do 
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not have to specify the value of K in advance. Similar to PCA, we first solve the optimal 
A = (a 1 ,..., a p ) G M PxP for the case of K = P, then take the first K columns to get the 
optimal A = (a 1 ,...,a K ) G M PxA for any K < P. We observe that images and labels are 
placed in the common space so that they represent both X and W. Given a digit image, we 
may find the nearest digit label or attribute label to tell what the image represents. 

The optimal A of K = 9 is then computed for several 7 m values. For each A, the 10000 
images of test dataset are projected to the common space and the digit labels and attribute 
labels are predicted. We observe in Fig. 2(a) that the classification errors become small when the 
regularization parameter is around 7 m = 0.1. Since ay does not contribute to A if w ij = 0, 
these error rates are computed using only 20% of X ; they improve to 0.0359 (d = 2) and 0.0218 
(d = 3) if W is used for the computation of A with K — 11 and 7 m = 0. 

It is important to choose an appropriate value of 7 m for minimizing the classification error. 
We observe in Fig. 2(b) that the curve of the true matching error of the test dataset is similar 
to the curves of the classification errors. However, the fitting error wrongly suggests that a 
smaller 7 m value would be better. Here, the fitting error is the matching error computed from 
the training dataset, and it underestimates the true matching error. On the other hand, the 
matching error computed by the cross-validation method of Section 4.2 correctly suggests that 
7 m = 0.1 is a good choice. 

3. Matching correlation analysis 

3.1. Matching error and matching correlation 

Let M G M. NxN be the diagonal matrix of row (column) sums of W. 

N 

M = diag(mi,..., m N ), m, : = ^ w t] . 

j =1 

This is also expressed as M = diag(W r l^r) G M. NxN using ljy G M. N , the vector with all elements 
one. M-W is sometimes called as the graph Laplacian. This notation will be applied to other 
weight matrices, say, M for \V. Key notations are shown in Table 1. 

Column vectors of matrices will be denoted by superscripts. For example, the k-th component 
of Y = (y ik ; i = 1,... f N, k = 1,..., K) is y k = (y lk ,..., y Nk ) T G R N for k = 1,..., K, and 
we write Y = (y 1 ,..., y K ). Similarly, X = (a; 1 ,..., x p ) with x k G M. N , and A = (a 1 ,..., a K ) 
with a k G M p . The linear transformation is now written as y k = Xa k , k — 1,..., K. 

The matching error of the &-th component y k is defined by 

1 N N 

— x 'y ^ y ^ fak ~~ Ujk ); 

i=l j =1 
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and the matching error of all the components is 0 = Ylk=i 0fc- By noticing W = W T , the 
matching error is rewritten as 

^ n ^ N N N 

m iVik + 2 5Z 171 jy% - EE WijVikVjk 

i=1 j=l 2—1 j=l 

= ?/' T (M - 

Let us specify constraints on F as 

N 

y kT My k = Y^'m t y 2 lk = 1, k=l,...,K. (2) 

2=1 

In other words, the weighted variance of yi k , ■ ■ ■ ,i/Nk with respect to the weights mi, ...,mjv 
is fixed as a constant. Note that we say “variance” or “correlation” although variables are not 
centered explicitly throughout. The matching error is now written as 

(fk = l-y kT Wy k . 

We call y kT Wy k as the matching (auto) correlation of y k . 

More generally, the matching error between the k -th component y k and Z-th component y l 
for k, l — 1 ,..., K , is defined by 

1 N N 

o - y= 1 _ y kTw y\ 

i=l j=1 

and the matching (cross) correlation between y k and y l is dehned by y kT Wy l . This is anal¬ 
ogous to the weighted correlation y kT My l with respect to the weights mi,...,mjv, but a 
different measure of association between y k and y l . It is easily verified that \y kT Wy l \ < 1 as 
well as \y k l My 1 \ < 1. The matching errors reduce to zero when the corresponding matching 
correlations approach 1. 

A matching error not smaller than one, i.e., <pk E 1, may indicate the component y k is not 
appropriate for representing W. In other words, the matching correlation should be positive: 

ykT Wy k > 0 

For justifying the argument, let us consider the elements y.ik, i = 1,..., N, are 
independent random variables with mean zero. Then E(y kT Wy k ) = '£2iLi w nV( y ik) = 0 if 
Wu = 0. Therefore random components, if centered properly, give the matching error (j) k ~ 1- 

3.2. The spectral graph embedding for dimensionality reduction 

We would like to find the linear transformation matrix A that minimizes 0 = (ftk- Here 

symbols are denoted with hat like Y = XA to make a distinction from those dehned in 
Section 3.3. Define P x P symmetric matrices 

G = X t MX, H = X T WX. 
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Then, we consider the optimization problem: 

Maximize tr (A 1 HA) with respect to A G R PxK (3) 

subject to A 7 GA = Ik- (4) 

The objective function tr (A 1 HA) = Y^k=i y kI W y k is the sum of matching correlations of y k , 
k = l,... ,K, and thus (3) is equivalent to the minimization of <f> as we wished. The constraints 
in (4) are y kT My 1 = Ski, k,l = 1,..., K. In addition to (2), we assumed that y k , k — 1,..., K, 
are uncorrelated each other to prevent a 1 ,..., a K degenerating to the same vector. 

The optimization problem mentioned above is the same formulation as the spectral graph 
embedding for dimensionality reduction of Yan et al. (2007). A difference is that W is specified 
by external knowledge in our setting, while W is often specified from X in the graph embedding 
literature. Similar optimization problems are found in the spectral graph theory (Chung, 1997), 
the normalized graph Laplacian (Von Luxburg, 2007), or the spectral embedding (Belkin and 
Niyogi, 2003) for the case of X = I N . 

3.3. Regularization and rescaling 

We introduce regularization terms AG and AH for numerical stability. They are P x P sym¬ 
metric matrices, and added to G and H. We will replace G and H in the optimization problem 
with 

G = G + AG, 11 II All 

The same regularization terms are considered in Takane, Hwang and Abdi (2008) for MCCA. We 
may write AG = 7 mLm and AH = 7 wL\v with prespecified matrices, say, Lm = L\v = Ip, 
and attempt to choose appropriate values of the regularization parameters ^m,3w £ K- 
We then work on the optimization problem: 

Maximize tr (A 1 HA) with respect to AgR P xK (5) 

subject to A t GA = Ik- ( 6 ) 

For the solution of the optimization problem, we denote G 1//2 G M PxP be one of the matri¬ 
ces satisfying [G l ^ 2 ) T G 1 / 2 = G. The inverse matrix is denoted by G -1 / 2 = (G 1//2 ) _1 . These 
are easily computed by, say, Cholesky decomposition or spectral decomposition of symmetric 
matrix. The eigenvalues of (G -1 / 2 ) 2 HG l ^ 2 are Ai > A 2 > • • • > \p 1 and the corresponding 
normalized eigenvectors are Ui^u-2, ... ,Up G M p . The solution of our optimization problem is 

A = G~ 1 / 2 (u 1 ,...,u k ). (7) 


The solution (7) can also be characterized by (6) and 

A t H A = A, 


( 8 ) 
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where A = diag(Ai,..., Ak)- Obviously, we do not have to solve the optimization problem 
several times when changing the value of K. We may compute a k = G~ 1 k 2 Uk, k = 1 ,..., P, 
and take the first K vectors to get A = (a 1 ,..., a A ) for any K < P. This is the same property 
of PCA mentioned in Section 14.5 of Hastie, Tibshirani and Friedman (2009). 

Suppose AG = AH = 0. Then the problem becomes that of Section 3.2, and (2) holds. From 
the diagonal part of (8), we have y kT Wy k = Ak, k — 1meaning that the eigenvalues 
are the matching correlations of y k 's. From the off-diagonal parts of (6) and (8), we also have 
y kT My l = y kT Wy l = 0 for k l, meaning that the weighted correlations and the matching 
correlations between the components are all zero. These two types of correlations defined in 
Section 3.1 explain the structure of the solution of onr optimization problem. Since the matching 
correlations should be positive for representing W, we will confine the components to those 
with Afc > 0. Let K + be the number of positive eigenvalues. Then we will choose K not larger 
than K + . 

In general, AG ^ 0, and (2) does not hold. We thus rescale each component as y k = bkXa k 
with factor bk > 0, k = 1,..., K. We may set bk = (a kT X T MXa k )~ 1 / 2 so that (2) holds. In 
the matrix notation, Y = XAB with B = diag(&i,..., 6 a-). Another choice of rescaling factor 
is to set bk = (a kT X T Xa k )- 1 / 2 , so that 

N 

y kT y k = Vik = k = k (9) 

i =1 

holds. In other words, the unweighted variance of yik, ■ ■ ■ ,VNk is fixed as a constant. Both 
rescaling factors defined by (2) and (9) are considered in the simulation study of Section 5, but 
only (2) is considered for the asymptotic theory of Section 6. 

In the numerical computation of Section 2 and Section 5, the data matrix X is centered 
as Y2iLi m i x i = 0 for (2) and Y2iLi x i = 0 f° r (9)- Thus the transformed vectors y, are also 
centered in the same way. The rescaling factors bk, k = 1 ,... ,K, are actually computed by 
multiplying (X^i 771 *) 1 ^ 2 f° r ^ ie weighted variance and N 1 ^ 2 for the unweighted variance. In 
other words, (2) is replaced by rri ( l J% k = Y^i=i m ii an( l (9) is replaced by Y2iLiVik = N. 
This makes the magnitude of the components y % k = 0(1), so that the interpretation becomes 
easier. The matching error is then computed as <j>k/(YliLi m i)- 



H. SHIMODAIRA/cross-validation of matching correlation analysis 


9 


4. Three types of matching errors 
4-1. Fitting error and true error 

A and Y are computed from W by the method of Section 3.3. The matching error of the k-th 
component y k is defined with respect to an arbitrary weight matrix W as 

1 N N 

Mw, w) = - ViAvik ~ y^Y- 

i =1 3 =1 

We will omit X from the notation, since it is fixed throughout. We also omit AG and AH 
from the notation above, although the matching error actually depends on them. We define the 
fitting error as 

<PI\w) = mw,w) 

by letting W = W. This is the <pk in Section 3.1. On the other hand, we define the true error 
as 

<f>t™(W,W)=<f> k (W,eW) 

by letting W = eW. Since <f>k(W, W) is proportional to W, we have used eW instead of W so 
that and 0^’ ue are directly comparable with each other. Let E(-) denote the expectation with 
respect to (1). Then E(W) = eW because E{wij) = E(zij)wij = ewij. Therefore, W = eW is 
comparable with W = W. 

4-2. Resampling matching weights for cross-validation error 

The bias of the fitting error for estimating the true error is 0(N~ 1 P) as shown in Section 6.4. 
We adjust this bias by cross-validation as follows. The observed weight W is randomly split 
into W — W* for learning and W* for testing, and the matching error <pk{W — W*, W*) is 
computed. By repeating it several times for taking the average of the matching error, we will 
get a cross-validation (cv) error. More formal definition of the cv error is explained below. 

The matching weights wf are randomly resampled from the observed matching weights w,j 
with small probability k > 0. Let z* 3 = z* t e {0,1}, i,j = l,...,N,be samples from Bernoulli 
trial with success probability k, where the number of independent elements is TV (TV + l)/2 due 
to the symmetry. Then the resampled matching weights are defined as 

K = z ij w m P ( z *j = M W ) = K - ( 10 ) 

Let E* (■\W), or E*f) by omitting W, denote the conditional expectation given W. Then 
E*(W*) = kW because E*(w*j ) = E*(z*j)wij = K,w t] . By noticing LT*((1 — k)~ 1 {W — W*)) = 
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W and E*(k l W*) = W, we use (1 — k) 1 (W — W*) for learning and k 1 W* for testing so 
that the cv error is comparable with the fitting error. Thus we define the cv error as 

4>f{w) = e*{m (i - - W*), k~ 1 W*) | wy 

The conditional expectation E*(-\W) is actually computed as the average over several FF*’s. 
In the numerical computation of Section 5, we resample W* from W with k — 0.1 for 30 times. 
On the other hand, we resampled W* from W only once with k — 0.1 in Section 2, because 
the average may not be necessary for large N. For each W*, we compute <j>k{{ 1 — /c) -1 (W — 
W*), k~ 1 W*) by the method of Section 3.3. A* is computed as the solution of the optimization 
problem by replacing W with (1 — k)~ 1 {W — W*). Then Y* is computed with rescaling factor 
h* k = ((1 - k)~ 1 a* kT X T (M - M*)Xa* k )~ 1 / 2 . 

4-3. Link sampling vs. node sampling 

The matching weight matrix is interpreted as the adjacency matrix of a weighted graph (or net¬ 
work) with nodes of the data vectors. The sampling scheme (1) as well as the resampling scheme 
(10) is interpreted as link sampling/resampling. Here we describe another sampling/resampling 
scheme. Let z % e {0,1}, i — 1,..., N, be samples from Bernoulli trial with success probability 
£ > 0. This is interpreted as node sampling, or equivalently sampling of data vectors, by taking 
Zi as the indicator variable of sampling node i. Then the observed matching weights may be 
defined as 

Wij = ZjZjU'ij. P(Zi = 1) = £, (11) 

meaning w l3 is sampled if both node i and node j are sampled together. For computing 0^. rue , 
we set e = £ 2 . The resampling scheme of W — W* should simulate the sampling scheme of W. 
Therefore, z* G {0,1}, i — 1,..., N, are samples from Bernoulli trial with success probability 
1 — v > 0, and resampling scheme is defined as 

<j = (1 - P(z* = 1) = 1-U. (12) 

For computing we set k — 1 — (1 — u) 2 « 2u. In the numerical computation of Section 5, 
we resample W* with v = 0.05 for 30 times. 

The vector x { does not contribute to the optimization problem if z % = 0 (then = 0 for all 
j ) or z* = 0 (then — w* 3 = 0 for all j). Thus the node sampling/resampling may be thought 
of as the ordinary sampling/resampling of data vectors, while the link sampling/resampling is 
a new approach. These two methods will be compared in the simulation study of Section 5. 
Note that the two methods become identical if the number of nonzero elements in each row (or 
column) of W is not more than one, or equivalently the numbers of links are zero or one for 
all vectors. CCA is a typical example: there is a one-to-one correspondence between the two 
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domains. We expect that the difference of the two methods becomes small for extremely sparse 

W. 

We can further generalize the sampling/resampling scheme. Let us introduce correlations 
corr (zij,Zki) and corr(4,4) in (1) and (10) instead of independent Bernoulli trials. The link 
sampling/resampling corresponds to corr (zij,Zki) = corr( 4 >4) = M* if indices are confined 
to i > j and k > l. The node sampling has additional nonzero correlations if a node is shared 
by the two links: coir^Zij, Zik) = £/(l + 0 ~ £ — £ 2 - Similarly the node resampling has nonzero 
correlations corr(4,4) = (1 — ^)/(2 — ^) ~ \~\- In the theoretical argument, we only consider 
the simple link sampling/resampling. It is a future work to incorporate the structural correla¬ 
tions into the equations such as (42) and (47) in Appendix C for generalizing the theoretical 
results. 

Although we have multiplied z*- or z* to all Wij elements in the mathematical notations, we 
actually look at only nonzero elements of in computer software. The resampling algorithms 
are implemented very efficiently for sparse W. 

5. Simulation study 

We have generated twelve datasets for CDMCA of D = 3 domains with P = 140 and N = 875 
as shown in Table 2. The details of the data generation is given in Appendix B.2. Considered are 
two sampling schemes (link sampling and node sampling), two true matching weights (Wa and 
Wb), and three sampling probabilities (0.02, 0.04, 0.08). The matching weights are shown in 
Fig. 3 and Fig. 4. The observed W are very sparse. For example, the number of nonzero elements 
in the upper triangular part of W is 162 for Experiment 1, and it is 258 for Experiment 5. 

X is generated by projecting underlying 5x5 grid points in M 2 to the higher dimensional 
spaces for domains with small additive noise. Scatter plots of y k , k = 1,2, are shown for 
Experiments 1 and 5, respectively, in Fig. 5 and Fig. 6 . The underlying structure of 5 x 5 grid 
points is well observed in the scatter plots for 7 m = 0 . 1 , while the structure is obscured in the 
plots for 7 m = 0. For recovering the hidden structure in X and W, the value 7 m = 0.1 looks 
better than 7 m = 0 . 

In Fig. 5(c) and Fig. 6 (c), the curves of matching errors are shown for the components y k , 
k = 1,..., 20. They are plotted at 7 m = 0, 0.001, 0.01, 0.1,1. The smallest two curves (k = 1, 2) 
are clearly separated from the other 18 curves (k = 3,..., 20), suggesting correctly K = 2. 
Looking at the true matching error <f^ ue {\V , \V), we observe that the true error is minimized 
at 7 m — 0.1 for k — 1, 2. The fitting error ^(W), however, underestimates the true error, and 
wrongly suggests 7 m = 0 . 

For computing the the cv error we used both the link resampling and the node 

resampling. These two cv errors accurately estimates the true error in Experiment 1, where 
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each node has very few links in W. In Experiment 5, some nodes have more links, and only 
the link resampling estimates the true error very well. 

In each of the twelve experiments, we generated 160 datasets of W. We computed the 
expected values of the matching errors by taking the simulation average of them. We look 
at the bias of a matching error divided by its true value. — E((f>^ ue ))/E((f>^ ue ) or 

(£’(0£ v ) — -S l (0fc lue ))/-^(0fc lue ) is computed for k — 1, ..., 10, with = 0.001, 0.01,0.1,1. These 
10 x 4 = 40 values are used for each boxplot in Fig. 7 and Fig. 8. The two figures correspond 
to the two types of rescaling factor respectively, in Section 3.3. We observe that the fitting 
error underestimates the true error. The cv error of link resampling is almost unbiased in Ex¬ 
periments 1 to 6, where W is generated by link sampling. This verifies our theory of Section 6 
to claim that the cv error is asymptotically unbiased for estimating the true error. However, it 
behaves poorly in Experiments 7 to 12, where W is generated by node sampling. On the other 
hand, the cv error of node resampling performs better than link resampling in Experiments 7 
to 12, suggesting appropriate choice of resampling scheme may be important. 

The two rescaling factors behave very similarly overall. Comparing Fig. 7 and Fig. 8, we 
observe that the unweighted variance may lead to more stable results than the weighted vari¬ 
ance. This may happen because only a limited number of y\k, ■ ■ ■ ,VNk is used in the weighted 
variance when W is very sparse. 

6 . Asymptotic theory of the matching errors 
6.1. Main results 

We investigate the three types of matching errors dehned in Section 4. We work on the asymp¬ 
totic theory for sufficiently large A under the assumptions given below. Some implications of 
these assumptions are mentioned in Section 6.2. 

(Al) We consider the limit of A —> oo for asymptotic expansions. P is a constant or increasing 
as A —» oo, but not too large as P = o(A 1 / 2 ). The sampling scheme is (1), and the 
resampling scheme is (10). The sampling probability of W from \V is e = o(l) but not 
too small as e _1 = o(A). The resampling probability of W* from W is proportional to 
AT" 1 ; K = 0(N~ l ) and k” 1 = O(A). 

(A2) The true matching weights are Wij = 0(1). In general, the asymptotic order of a matrix 
or a vector is defined as the maximum order of the elements in this paper, so we write 
\V = 0(1). The number of nonzero elements of each row (or each column) is #{wij ^ 
0, j = 1, • • •, A} = 0(e -1 ) for i = 1,..., A. 

(A3) The elements of X are = 0(A^ 1 / 2 ). This is only a technical assumption for the 
asymptotic argument. In practice, we may assume x % ^ = 0(1) for MCA computation, 
and redefine x k := £C fc /||a3 A '|| for the theory. 



H. SHIMODAIRA/cross-validation of matching correlation analysis 


13 


(A4) Let 7 be a generic order parameter for representing the magnitude of regularization terms 
as AG = 0( 7 ) and AH = 0( 7 ). For example, we put Lm = 0(1) and 7 m = 7 - We 
assume 7 = 0(N~ 1 / 2 ). 

(A5) All the P eigenvalues are distinct from the others; A* 7 ^ A j for i ^ j. A is of full rank, 
and assume A = 0(P _1 ). We evaluate fg- only for k = 1,..., J, for some J < P. We 
assume that J is bounded and (A* — Aj ) _1 = 0(1) for i ^ j with i < P, j < J. These 
assumptions apply to all the cases under consideration such as AG = AH — 0 or W 
being replaced by eW. 

Theorem 1. Under the assumptions mentioned above, the following equation holds. 

E(<frl\W) - W)) = (1 - e)E(<ff(W) - m) + 0(N~ 3 / 2 P 2 + 7 3 P 3 ). (13) 

This implies 

E((jffT{W)) = P(4 rue (FF, W)) + 0(eN~ 1 P + N~^ 2 P 2 + 7 3 P 3 ). (14) 

Therefore, the cross-validation error is an unbiased estimator of the true error by ignoring the 
higher-order term of 0(eN~ 1 P + N~ 3 ^ 2 P 2 + y 3 P 3 ) = o(l), which is smaller than Eifff (W)) = 
0(1) for sufficiently large N. 

Proof. By comparing (27) of Lemma 3 and (30) of Lemma 4, we obtain (13) immediately. Then 
(14) follows, because E{<f^f(W) — 4>k' ue {W)) = 0(N~ 1 P) as mentioned in Lemma 3. □ 

We use N , P, 7 , e in expressions of asymptotic orders. Higher order terms can be simplified by 
substituting P = o(N 1 ^ 2 ) and 7 = 0(A~ 1//2 ). For example, 0 (A _ 3 // 2 P 2 -|- 7 3 P 3 ) = o(A _ 1 // 2 +l) = 
o(l) in (13). However, we attempt to leave the terms with P and 7 for finer evaluation. 

The theorem justifies the link resampling for estimating the true matching error under the 
link sampling scheme. Now we have theoretically confirmed our observation that the cross- 
validation is nearly unbiased in the numerical examples of Sections 2 and 5. Although the 
fitting error underestimates the true error in the numerical examples, it has not been clear that 
the bias is negative in some sense from the expression of the bias given in Lemma 3. 

In the following subsections, we will discuss lemmas used for the proof of Theorem 1. In 
Section 6.2, we look at the assumptions of the theorem. In Section 6.3, the solution of the 
optimization problem and the matching error are expressed in terms of small perturbation 
of the regularization matrices. Lemma 1 gives the asymptotic expansions of the eigenvalues 
Ai,..., Ap and the linear transformation matrix A in terms of AG and AH. Lemma 2 gives 
the asymptotic expansion of the matching error ,W) in terms of AG and AH. LIsing 

these results, the bias for estimating the true error is discussed in Section 6.4. Lemma 3 gives 
the asymptotic expansion of the bias of the fitting error, and Lemma 4 shows that the cross- 
validation adjusts the bias. All the proofs of lemmas are given in Appendix C. 
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6.2. Some technical notes 

We put K = P for the definitions of matrices in Section 6 without losing generality. For 
characterizing the solution A e M PxP of the optimization problem in Section 3.3, ( 6 ) and ( 8 ) 
are now, with A = diag(Ai,..., Ap), 

A t GA = Ip, A t HA = A. (15) 

We can do so, because the solution a k and the eigenvalue A& for the k-th component does not 
change for each 1 < k < K when the value of K changes. This property also holds for the 
matching errors of the k-th component. Therefore a result shown for some 1 < k < P with 
K = P holds true for the same k with any K < P, meaning that we can put K = P in 
Section 6 . In our asymptotic theory, however, we would like to confine k to a finite value. So 
we restrict our attention to 1 < k < J in the assumption (A5). 

The assumption of N and P in (Al) covers many applications in practice. The theory may 
work for the case that N is hundreds and P is dozens, or for a more recent case that N is 
millions and P is hundreds. 

Asymptotic properties of W follow from (Al) and (A 2 ). Since the elements of W are sampled 
from W, we have W = 0(1) and M = 0(1). The number of nonzero elements for each row 
(or each column) is 


#{ 1 %0,j = 1,...,1V} = 0(1), i = l,...,N, (16) 

and the total number of nonzero elements is ^ 0 ,i,j = 1 = 0(N ). Thus W 

is assumed to be a very sparse matrix. Examples of such sparse matrices are image-tag links 
in Flickr, or more typically friend links in Faeebook. Although the label domains of MNIST 
dataset do not satisfy (16) with many links to images, our method still worked very well. 

The assumption of k in (Al) implies that the number of nonzero elements in W* is 0(1). 
Similarly to the leave-one-out cross-validation of data vectors, we resample very few links in 
our cross-validation. 

From (A3), Yl!i=i m i x ikXji = 0(N(N^ 1 0) 2 ) = 0(1), and thus X T MX = 0(1), and G = 
0(1). Also, Y^=iY^=\ w H x ik x ji = y.i.jy.w;,.-o w ti x n- x ji = 0(N(N~ 1 /' 2 ) 2 ) = 0(1), and thus 
W P TFX = 0(1), and H = 0(1). From (A5), Ef=i Ef=i ( G ha ik a jk = 0(P 2 (P~ 1 ) 2 ) = 0(1), 
which is necessary for A 1 GA = Ip. Then Y = XA = 0(PN~ 1 ^ 2 P~ 1 ) = 0(N~ 1 / 2 ). 

The assumptions on the eigenvalues described in (A5) may be difficult to hold in practice. 
In fact, there are many zero eigenvalues in the examples of Section 5; 60 zeros, 40 positives 
(A' + = 40), and 40 negatives in the P = 140 eigenvalues. Looking at Experiment 1, however, 
we observed that q ~ (j) t l f ue holds well and A(0^ v ) = E((j)^ ue ) holds very accurately for 
k = 1,...,40 (Afc > 0) when 7 m > 0. The eigenvalues for 7 m = 0.1 are Ai = 0.988, A 2 = 
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0.978, A 3 = 0.562, A 4 = 0.509, A 5 = 0.502,..., where Ai — A 2 looks very small, but it did not 
cause any problem. On the other hand, the eigenvalues for 7 m = 0 are Ai = 0.999, A 2 = 
0.997, A 3 = 0.973, A 4 = 0.971, A 5 = 0.960,..., where some A k are very close to each other. This 
might be a cause for the deviation of fiff from 0 ^, rue when 7 m = 0 . 

6.3. Small change in A, A, and the matching error 

Here we show how A, A and (f>k(W, W) depend on AG and AH. Recall that terms with 
hat are for AG = AH = 0 as defined in Section 3.2; G = X J MX, H = X 1 WX, Y = 
XA. The optimization problem is characterized as A 1 GA = Ip, A 1 H A = A, where A = 
diag(Ai,..., Ap) is the diagonal matrix of the eigenvalues Ai,..., Ap. Then A and A are defined 
in Section 3.3 for G = G + AG and H = H + AH. They satisfy (15). The asymptotic 
expansions for A and A will be given in Lemma 1 using 

g = A t AGA, h = A T AHA e R PxP . 

For proving the lemma in Section C.l, we will solve (15) under the small perturbation of g and 

h. 

Lemma 1 . Let AA = A — A with elements AA* = A j — Ap i — 1,..., P. Define C G R PxP as 
C = A~ X A — Ip so that A = A(Ip + C). Here A 1 exists, since we assumed that A is of full 
rank in (A5). We assume g = 0 ( 7 ) and h = 0( 7 ). 

Then the elements of A A = 0( 7 ) are, for i = 1 ,... ,P, 

AA i = —(gu\i — ha) + <5Aj (17) 

with 5\i = 0 ( 7 2 P) defined for i < J as 

5\i = gifigaXi - ha) - ^(Aj - AO^^fi'pAi - %) 2 + 0 ( 7 3 P 2 ), (18) 

j Ai 

where * s th e summation over j — 1,..., P, except for j = i. 

The elements of the diagonal part of C = 0( 7 ) are, for i = 1 ,... ,P, 

Cii 2 d lr d~ (19) 

with 5ca = 0 ( 7 2 P) defined for i < J as 

Sc « = !(g ,,) 2 - Y.X - YhrnA. - hifigifix, - ^A.) - thfi + 0(r'P 2 ), (20) 

and the elements of the off-diagonal (i j) part of C are, for either i < J,j < P or i < P,j < 
J, i.e., one of i mid j is not greater than J, 

Cij (Aj A fi) (flijXj hj.j ) T dcjj 


( 21 ) 
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with 5cij = 0 ( 7 2 P) defined for i < P, j < J as 


S Cij 


2 (Aj Aj) (37/Aj hij^djj (A? Ay) (57/Ay Ayj)((/yyAy tijfi) 

“H (Aj Ay) ^(Afc Ay) (fjkifij Afcy) {^fjkj Aj A/.-y ). 

Wj 


( 22 ) 


The asymptotic expansion of (j) k {W , W) is given in Lemma 2. For proving the lemma in 
Section C.2, the matching error is first expressed by C, and then the result of Lemma 1 is used 
for simplifying the expression. 

Lemma 2. Let 5 = A T X T (M - W)XA = Y T (M - W)Y € M PxP We assume S = 0(1), 
which holds for, say, W = W and W = eW. We assume g = 0(y) and h = 0(y). Then the 
matching error of Section 3.1 is expressed asymptotically as 

,^W) Sfcfc T ^ ^ 2(Aj A if) Sik(,9ik Afc Ayfc) 

i^k 

^ ^'-’/cfc (^Qik^k Siki^Qki^i hki)(^Qkk^k ft**)} 

^ (23) 

+EEA Afc) (Ay Afc) •^2sjfc((yf.jj Afc Ajj) (,^7yfc Afc hjfi) 

ijkk j^k 

T Sij(.9ik^k h ik )(gj k \k Aj*,)} T 0(y P ). 


Let its further assume W = W. Then S = Ip — A with elements §ij = Ay(l — Ay). Substituting 
it into (23), we get an expression of <j^f(W) = 4> k (W, W) as 


4>l\W) = 1 - Afc - J](Ay - Afc) _1 (^fcAfc - h ik ) 2 + 0( 7 3 P 3 ). (24) 

i^k 

It follows from (A4) and (A5) that the magnitude of the regularization terms are expressed 
as g = 0 ( 7 ) and h = 0( 7 ). This is mentioned as an assumption in Lemma 1 and Lemma 2 
for the sake of clarity. Although these two lemmas are shown for the regularization terms, they 
hold generally for any small perturbation other than the regularization terms. Later, in the 
proofs of Lemma 3 and Lemma 4, we will apply Lemma 1 and Lemma 2 to perturbation of 
other types with 7 = 0(N~ 1 ^ 2 ) or 7 = 0(N~ 1 ). 


6.4- Bias of the fitting error 

Let us consider the optimization problem with respect to eW. We define A and A as the 
solution of A 1 GA = Ip and A 1 HA = A with G = eX 1 MX and H = eX^WX. The 
eigenvalues are Ai,..., Ap and the matrix is A = diag(Ai,..., Ap). We also define Y = X A. 
These quantities correspond to those with hat, but W is replaced by eW. We then define 
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matrices representing change from eW to W: AW = W — eW, AM = M — eM, AG = 
X T A MX and AH = X T AW X. 

In Section 6.3, g and h are used for describing change with respect to the regularization 
terms AG and AH. Quite similarly, 

g = A T AGA = Y t AMY, h = A t AHA = Y T AWY 

will be used for describing change with respect to A G and AH , namely, change from eW to 
W. The elements of g = ( g^ ) and h = () are 

N 

9ij = (y l ) T AMy J = J2(yum + ym.iy m j)Awi m + ^ yuyijAwn, 

l>m 1=1 

N 

hij (j/*) ^ ] (ynymj H - ^ ^ VliVlj 

l>m 1=1 

and they will be denoted as 

9ij = [YYlln^Wlm, hij = J2 n i^YL A Mn, (25) 

l>m l>m 

where Ei> m = M =2 E m =i and Ei>m = Ei=i Em=i- 

We are now ready to consider the bias of the fitting error. The difference of the fitting error 
from the true error is 

f>l\W) - (PT(W, W) = MW , AW). (26) 

The asymptotic expansion of (26) is given by Lemma 2 with W = AW. This expression will 
be rewritten by g and h in Section C.3 for proving the following lemma. For rewriting A and 
A in terms of A and A, Lemma 1 will be used there. 

Lemma 3. Bias of the fitting error for estimating the true error is expressed asymptotically as 

E ((f>f(W) - 0fc me (TF, W)) = bias*, + 0{N~ z ' 2 P 2 + 7 3 P 3 ), (27) 

where bias*, = 0(N~ 1 P) is defined as 

bias*, E ( e/kk l^kkffkk T ^ ( 2(A j A*,) ( 9jk hjk)(9jk^k hjk ) 

jAk 

using the elements of g, h and A mentioned above, bias*, can be expressed as 
bias* = £ (1 - £ ) 2>L[ - - w[y-]f*)e[y],“ 

l>m 

+ 27 2(A, - A fc ) -1 (C/[F]^ - H{Y]*)(g{Y}*\ t - H{Y]A • 

We also have g = 0{N~ 1 / 2 ) and h = 0(iV” 1//2 ). 


( 28 ) 
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In the following lemma, we will show that the bias of the fitting error is adjusted by the cross- 
validation. For proving the lemma in Section C.4, we will first give the asymptotic expansion 
of <j) k ((l ~,k)~\W-W*),*,- 1 W*) using Lemma 2. The expression will be rewritten using the 
change from A and A to those with respect to (1 — k)~ 1 (W — W*). The asymptotic expansion 
of qiff will be obtained by taking the expectation with respect to (10). Finally, we will take the 
expectation with respect to ( 1 ). 

Lemma 4. The difference of the fitting error from the cross-validation error is expressed asymp¬ 
totically as 

^‘(W)-C(W) = 5>L[-(a[v£ - «[V]“)<?[V]“ 

l>m 

+ 2(A, - Y\t - H[y£.)(e|y£t - «[v£)] (29) 

j¥=k 

+ 0{N~ 2 P 2 + N~ lr yP 2 + 7 3 P 3 ), 

and its expected value is 

E((t)f{W) - <f%(W)) = (1 - e) -1 biasfc + 0(N~ 3/2 P 2 + 7 3 P 3 ). (30) 
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Appendix A: Cross-domain matching correlation analysis 


A.l. A simple coding for cross-domain matching 


Here we explain how CDMCA is converted to MCA. Let xf' 1 G R Pd , i — 1,..., rig, denote the 
data vectors of domain d. Each x < 'f } is coded as an augmented vector xf^ G M p defined as 



Here, 0 P G is the vector with zero elements. This is a sparse coding (Olshausen and Field, 
2004) in the sense that nonzero elements for domains do not overlap each other. All the N 
vectors of D domains are now represented as points in the same M p . We take these N vectors 
as Xi G M p , i = 1,..., N. Then CDMCA reduces to MCA. The data matrix is expressed as 


X T = (x[ y \ 


™(i) 

1 ^ni ? 




X 




X 


n D )• 


Let G W ldXpd be the data matrix of domain d defined as (A^) T = (x^\ ..., Xnj ). The 
data matrix of the augmented vectors is now expressed as X = Diag(X^,... ,X (D ' > ), where 
DiagQ indicates a block diagonal matrix. 

Let us consider partitions of A and Y as A T = ((A (1 ^) T , ..., ( A (D ^) T ) and Y 7 = 
((Y ( ' 1 ' ) ) t ,..., (Y I ' D ' ) ) t ) with A^ G W dXR and Y ^ G R ndXK . Then the linear transforma¬ 
tion of MCA, Y = XA, is expressed as 


Y (d > =X^A (d \ d=l,...,D, 


which are the linear transformations of CDMCA. The matching weight matrix between domains 
d and e is = (w^) G M ndXne for d, e = 1,..., D. They are placed in a array to define 

W = (W (de ^) G R NxN . Then the matching error of MCA is expressed as 


D D n d 


EEEE 

d =1 e=l i= 1 j =1 


W. 


(<fe)l | yf 


y 


(e)112 


which is the matching error of CDMCA. 

Notice M = Diag with M^ = diag((VT( dl V x .., W^)1 N ), and so 

X t MX = Dmg((X^) T M^X il \...,(X^) T M^X^) is computed efficiently by look¬ 
ing at only the block diagonal parts. 

Let us consider a simple case with m = ■■■ — no — n, N = nD , W (de ' 1 = Cd e I n using 
a coefficient Q e > 0 for all d, e = Then CDMCA reduces to a version of MCCA, 

where associations between sets of variables are specified by the coefficients Q e (Tenenhaus and 
Tenenhaus, 2011). Another version of MCCA with all c*, = 1 for d e is discussed extensively 
in Takane, Hwang and Abdi (2008). For the simplest case of D = 2 with cu = 1, cn = c 22 = 0, 
CDMCA reduces to CCA with W = ( £ ). 
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A.2. auto-associative correlation matrix memory 


Let us consider f2 e M PxP defined by 

1 N N 

^ = o w v( x i + x j)( x i + x j) T ■ 

i =1 j=l 

This is the correlation matrix of Xi + Xj weighted by Wij. Since Q = X 1 MX + X 7 WX , we 
have + AG + AiT = G + iT, and then MCA of Section 3.3 is equivalent to maximizing 
tr(A T (fi + AG + AH)A) with respect to A e W PxK subject to ( 6 ). The role of H is now 
replaced by f2 + AG + AH. Thus MCA is interpreted as dimensionality reduction of the 
correlation matrix il with regularization term AG + AH. 

For CDMCA, the correlation matrix becomes 


n d 


D D 

d= 1 e=l j=l j=l 




i ~ ( e ) 

+ X J 




)(*i 


This is the correlation matrix of the input pattern of a pair of data vectors 





(32) 


weighted by w\^ e \ Interestingly, the same correlation matrix is found in one of the classical 
neural network models. Any part of the memorized vector can be used as a key for recalling the 
whole vector in the auto-associative correlation matrix memory (Kohonen, 1972), also known 
as Associatron (Nakano, 1972). This associative memory may recall x + x 77 for input key 
either x^ or x^ if > 0. In particular, the representation (32) of a pair of data vectors 
is equivalent to eq. (14) of Nakano (1972). Thus CDMCA is interpreted as dimensionality 
reduction of the auto-associative correlation matrix memory for pairs of data vectors. 


Appendix B: Experimental details 
B.l. MNIST handwritten digits 

The MNIST database of handwritten digits (LeCun et ah, 1998) has a training set of 60,000 
images, and a test set of 10,000 images. Each image has 28 x 28 = 784 pixels of 256 gray levels. 
We prepared a dataset of cross-domain matching with three domains for illustration purpose. 

The data matrix X is specified as follows. The first domain (d = 1) is for the handwritten 
digit images of n [ = 60, 000. Each image is coded as a vector of p\ = 2784 dimensions by 
concatenating an extra 2000 dimensional vector. We have chosen randomly 2000 pairs of pixels 
x[i,j], x[k,l] with |i — k| <5, |j — 1 1 < 5 in advance, and compute the product x[i,j] 
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* x[k,l] for each image. The second domain (d = 2) is for digit labels of ri 2 = 10. They are 
zero, one, two, ..., nine. Each label is coded as a random vector of p 2 = 100 dimensions with 
each element generated independently from the standard normal distribution 1V(0,1). The third 
domain (d = 3) is for attribute labels even, odd, and prime (n 3 = 3). Each label is coded as a 
random vector of p 3 = 50 dimensions with each element generated independently from 1V(0,1). 
The total number of vectors is N = ri\ + n 2 + n 3 = 60013, and the dimension of the augmented 
vector is P = pi + P 2 + P 3 = 2934. 

The true matching weight matrix W is specified as follows. The cross-domain matching 
weight W *' 12 ) G M niXn2 between domain-1 and domain-2 is the 1-of -K coding; w ^ 12) = 1 if i-th 
image has j-th label, and = 0 otherwise. e l"i xn 3 is dehned similarly, but images 

may have two attribute labels such as an image of 3 has labels odd and prime. We set all elements 
of as zeros, pretending ignorance about the number properties. We then prepared W 

by randomly sampling elements from W with e = 0 . 2 . Only the upper triangular parts of the 
weight matrices are stored in memory, so that the symmetry W = W 1 is automatically hold. 
The number of nonzero elements in the upper triangular part of the matrix is 143775 for W, 
and it becomes 28779 for W. In particular, the number of nonzero elements in hW 12 ) is 12057. 

The optimal A is computed by the method of Section 3.3. The regularization matrix is block 
diagonal L M = Diag(L$, L$, L$) with L$ = a d I Pd and a d = tr ({X^ d) ) T M (d) X (d) )/p d . The 
regularization parameters are 7 m > 0 and 7 w = 0. The computation with 7 m = 0 actually 
uses a small value 7 m = 10~ 6 . The number of positive eigenvalues is K + = 11. The appropriate 
value K = 9 was chosen by looking at the distribution of eigenvalues A*, and 0£ v . 

The three types of matching errors of Section 4 are computed as follows. The plotted values 
are not the component-wise matching errors (j>k, but the sum <f> = Ylk=i The true matching 
error <f^ ue (\V, \V) is computed with W of the test dataset here, while A is computed from 
the training dataset. This is different from the definition in Section 4.1 but more appropriate if 
test datasets are available. The fitting error <f^d{\V) and the cross-validation error <f c ^{\V) are 
computed from the training dataset. In particular, <ff^{W) is computed by resampling elements 
from W with k = 0.1 so that the number of nonzero elements in the upper triangular part of 
W* is about 3000. 

B.2. Simulation datasets 

We generated simulation datasets of cross-domain matching with D = 3, p\ = 10, P 2 = 30, p 3 = 
100, ri\ = 125, ri 2 = 250, n 3 = 500. The two true matching weights Wa and Wb were created 
at first, and they were unchanged during the experiments. In each of the twelve experiments, 
X and W are generated from either of Wa and Wb, and then W is generated independently 
160 times, while X is fixed, for taking the simulation average. Computation of the optimal A 
is the same as Appendix B.l using the same regularization term specified there. 
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The data matrix X is specified as follows. First, 25 points on 5 x 5 grid in M 2 are placed 
as (1,1), (1, 2), (1, 3), (1,4), (1, 5), (2,1),, (5, 5). They are ( x[ 0) ) T , {x ( ^) T , where d = 0 is 
treated as a special domain for data generation. Matrices B^ e M PdX2 , d — 1,2, 3, are prepared 
with all elements distributed as A(0,1) independently. Let n^i be the number of vectors in 
domain-d generated from i-th grid point for d — 1, 2, 3, i — 1,,.. ,25, which will be specified 
later with constraints Y!i=i n d,i = nd- Then data vectors {x[ d \i = 1 ,...,rid} are generated 
as x[ d j = B (d) xf r) + e[ d j, i = 1,... ,25, j = 1,..., n^i, with elements of e[ d - distributed as 
A(0, 0.5 2 ) independently. Each column of X^ is then standardized to mean zero and variance 
one. 

The true matching weight matrix W is specified as follows. Two data vectors in different 
domains d and e are linked to each other as = 1 if they are generated from the same 
grid point. All other elements in W are zero. Two types of W, denoted as W A arid Wb, 
are considered. For Wa, the numbers of data vectors are the same for all grid points; tt-i,* = 5, 
ti 2 ,i = 10, ri 3 t i = 20, i = 1,..., 25. For Wb, the numbers of data vectors are randomly generated 
from the power-law with probability proportional to n~ 3 . The largest numbers are 711,17 = 26, 
77 . 2,9 = 49, 77 , 3,23 = 349. The number of nonzero elements in the upper triangular part of the 
matrix is 8750 for W A (1250, 2500, 5000, respectively, for W^ 2 \ W^ 3 \ W^ ), and it is 6659 
for W B (906, 1096, 4657, respectively, for W£ 2) , W£ 3) , W (23) ). 

For generating W from W, two sampling schemes, namely, the link sampling and the node 
sampling, are considered with three parameter settings for each. For the link sampling, Wij are 
sampled independently with probability e = 0.02, 0.04,0.08. For the node sampling, vectors are 
sampled independently with probability £ = V0.02 « 0.14, \/0.04 = 0.2, V0.08 ~ 0.28. Then 
is sampled when both vectors Xi and Xj are sampled simultaneously. 

Appendix C: Technical details 
C. 1. Proof of Lemma 1 

The following argument on small change in eigenvalues and eigenvectors is an adaptation of 
Van Der Aa, Ter Morsche and Mattheij (2007) to our setting. The two equations in (15) are 
(Ip+C) T A T (G+AG)A(I P +C)—Ip = Oand {I p +C) t A t (H+AH)A(I p +C)-A-AA = 0. 
They are expanded as 

(C T + C + g) + [C T C + C T g + gC) = O^P 2 ), (33) 

(C T A + AC + h - AA) + ( C t AC + C T h + hC) = 0(q 3 P 2 ), (34) 

where the first part is O(y) and the second part is 0(y 2 P) on the left hand side of each equation. 

First, we solve the O(q) parts of (33) and (34) by ignoring 0(y 2 P) terms. 0(j) part in (33) is 
C T + C + g = 0(y 2 P), and we get (19) by looking at the (i,i) elements c tl + c tl + g lt = 0(y 2 P). 
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Then substituting C T = — C — g + 0( 7 2 P) into (34), we have 

AC -CA-gA + h-AA = 0(j 2 P). (35) 

Looking at (i,j) elements A.jC^ — c l3 X 3 — gijXj + hij = 0( 7 2 P) in (35) for i ^ j, and noticing 
(A,; — Aj) _1 = 0(1) from (A5), we get (21). We also have (IT) by looking at (i,i) elements 
X/(‘/t QjAi fftiXi 4“ ha AA? 0(y P) in (3o). 

Next, we solve 0( 7 2 P) parts of (33) and (34) by ignoring 0( 7 3 P 2 ) terms. For extracting 
0( 7 2 P) parts from the equations, we simply replace C with SC, AA with SA, g with 0, and 
h with 0 in the 0(j) parts. By substituting C T + C + g = SC T + SC + 0(y 3 P 2 ) into (33), we 
get 

SC T + SC + C T C + C T g + gC = 0( 7 3 P 2 ), (36) 

and the ( i , i ) elements give 

Sea = - \{C T C)a - ( gC) u + 0( 7 3 P 2 ). (37) 

By substituting C T A + AC + h - A A = SC T A + A SC - <5A + 0( 7 3 P 2 ) into (34), we get 

hC T A + A SC -SAP C T AC + C T h + hC = 0( 7 3 P 2 ). (38) 

Rewriting (36) as 5C T = —( SC + C 7 C + C T g + gC) = 0( 7 3 P 2 ), we substitute it into (38) to 
have 

-SCA + AAC - gCA + HC -SA + C T AA = 0( 7 3 P 2 ), (39) 

where (35) is used for simplifying the expression. Then we get 

S Cij = (Xi - xj)- 1 (^gC) ij X j - (hC)ij - Cji AA,) + 0( 7 3 P 2 ) (40) 

by looking at (i, j) elements (i ^ j ) of (39). Also we get 

SXi = —(gC)aXi + ( hC)a + cuAXi + 0( 7 3 P") (41) 


by looking at (qi) elements of (39). 

Finally, the remaining terms in (37), (40), (41) will be written by using (17), (19), (21). For 
any (i,j) with i < P,j < J , {gC)a = -\gijgjj + Y,kAj(^k-X j )- 1 {g kj X j -h kj )g ki + 0(^P 2 ) and 
(■ hC)ij = -\hijgjj + £ fc _^(A fe ~ ^i)~ l ( 9 kjXj - h kj )h ki + 0( 7 3 P 2 ). For (i,j) with i ± j and one 
of i and j being not greater than J, c 3i AXj = — (A,- — A, ) '(gjjXj — /',;.) (///./A, — hjj) + 0( 7 3 P 2 ). 
For i < J, (C T C)a = \{gu) 2 + £^(A,- - X i )~ 2 (g ji Xi - h 31 ) 2 + 0( 7 3 P 2 ). For i < P, q?AA? = 
(l/2)ga(gaXi — ha) + 0( 7 3 P 2 ). Using these expressions, we get (18), (20), and (22). 
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C. 2. Proof of Lemma 2 

Let us denote C = (c 1 ,. • ■, c p ) and Ip = (8\8 p), where the elements are c k = 
(cifc,..., c Pk ) T and 8 k = (8 lk ,•.., 8 Pk ) T . Then a k = A(S k + c k ) and y k = b k Xa k = 
b k xA(8 k + c k ). Noticing <j> k {W, W) = (1/2) E,=! Ef=i - V jk ) 2 = ( y k ) T (M - W)y k , 
and substituting y k = b k xA(S k + c k ) into it, we have 0 fc (W, W") = ^(<E + c k ) T S(8 k + c k ) = 
&fc(3fcfc + 2E4dfc + E; E., %'>'>)• Similarly, we have b\ = {{a k ) T X T MXa k )~ l = {{8 k + 
c k ) T {8 k +c k ))~ l = (l+2c kk +(C T C) kk )~ 1 = l—2c kk — (C T C) kk +4:(c kk ) 2 +0( r Y 3 P). Substituting 
it into o k {W. W).' we have 

Mw,w) —s kk + 2 E Sik^ik EE $ij CikCjk 

i i j 

2 Ckk^kk 4 Ckk E Sik^ik ( C T C) kk s kk + 4(cfcfc) 2 Sfcfc + 0(y 3 P 2 ) 

i 

=5fcfc(l + ( c kk ) 2 — (■ C T C ) kk ) 

+ 2 ^ s ifc c ifc (l - c fcfc ) + ^ ^ % CifcCjfc + 0(7 3 P 2 ), 

i^k i^k j^k 

which gives (23) after rearranging the formula using the results of Lemma 1. In particular, the 
last term E/-/;. E/y SijCi k Cj k E/y;. E. j -/■ 1 - Sij(\i •E) (E •E) ( 9ikAk Efe) (djk^k ^ jk ) T 

0(7 3 P 3 ) leads to the asymptotic error 0(7 3 P 3 ) of (23). 

C. 3. Proof of Lemma 3 

First note that An); m = wi m — ewi m = ( zi m — e)wi m from (1), and so E(Awi m ) = 0 and 

E{Aw lm Aw Vm ,) = 8 w 8 mm re( 1 - e)w z 2 m . (42) 

From (25) and the definition of bias*,, we have 

bias t = s[^ ^ A%„Au>, w {-(e[y]“-w[y]“)e[y]“, 

l>m l'>m' 

+ E 2 <v - vr'ieye - «[?];b(s[?]Ev - «[V]? m -)}]. 

and thus we get (28) by (42). Both and hij are of the form E;> m fimAwi m with fi m = 0(1V _1 ) 
in (25), where the number of nonzero terms is 0(e~ 1 N) in the summation Ez> m - ^ then 
follows from (42) that VfimAw lm ) = Ez>m fLV = e(l - e) Ei> m fLtfm = 
0(eN~ 2 e~ 1 N ) = 0(1V _1 ). Therefore Yh> m fim.Awi m = C^IV -1 / 2 ), showing g = 0(A -1 / 2 ) and 
h = 0(N~ 1/2 ). 
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In order to show (27), we prepare C = (c 7 ) and AA = diag(AAi,... AAp) with 

A = A(Ip + C), A = A + AA 

for describing change from A to A. The elements are given by Lemma 1 with 7 
particular (17), (19) and (21) become 

AA i = -(g ll X i -h ii ) + 0(N- 1 P), 

c-u = — -ga + 1 -P)i 

Cij = (A i - A ; ) : {ijijXj - hij) + 0(N~ 1 P). 

Note that the roles of W, g and h in Lemma 1 are now played by eW, g and h , respectively, 
and therefore the expressions of C and AA in Lemma 1 give those of C and AA above. 

Let us define 

AS = A t X t (AM - AW)xA. 

Then the difference of the htting error from the true error, namely (26), is expressed asymp¬ 
totically by (23) of Lemma 2 with S = AS. Substituting A = A(I P + C) into AS, we get 

AS = (Ip + C) T (g — h)(I P + C) 

= g-h + (g- h)C + C T {g - h) + 0(N~ 3/2 P 2 ) 

= g -h + 0{N~ 1 P), 

where AS = 0(N~ 1 P) but E(AS) = 0(N~ 1 P), since E(g) = E{h ) = 0. 

We now attempt to rewrite terms in (23) using the relation W = eW +A1L. Define g = O(j), 
h = 0 ( 7 ) by 

g = A t AGA, h = A t AHA. 

Then g = (I + C) T g(I + C ) = g + OiN^jP) and h = (I + C) T h(I + C) = h + 0{N- l ' 2 ^P). 
We also have (A* — A*,) -1 = (A* — A*,) -1 + 0(N~ l h 2 ), since \ = \ + 0(A" 1//2 ). We thus have 
9ikK ~ Kk = <hk\ ~ hik + 0(N~ l/2 ^P). Therefore, (23) with S = AS is rewritten as 

MW,AW) = As kk + 2 C^ - hr'AsMikXk - h ik ) + 0 (A- 1 / 2 7 2 p 2 + 7 3p 3) 

i^k 

= As kk + ^2 2 (-^ - - h ik )(gMk ~ hik ) + 0 (N~ l ^P 2 + 7 3 P 3 ). 

i^k 

By noting E(g ik - h ik ) = 0, we get 


= N~ 1 / 2 . In 


(43) 


E(MW, AW)) = E(As kk ) + 0(NMP 2 + 7 3 ^ 3 )- 


(44) 
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For calculating E(A§kk), we substitute (43) into A Skk = 9kk~ hkk+^{(g—h)C)kk+0(N~ 3 ^ 2 P 2 ). 
Then we have 

A Skk 9kk hkk (^9kk hfi h)9kk 

+ ^2(9jk - hjk){Xj - \k)-\9jk\k - h jk ) + 0{N~ 3 / 2 P 2 ), 

jAk 

and therefore, by noting E(g k k ~ hkk ) = 0, we obtain 

E(A§kk) = bias fc + 0{N~ 3 ^ 2 P 2 ). 

Combining it with (26) and (44), and also noting 0(N~ 3 ^ 2 P 2 + N^'yP 2 ) = 0(N~ 3 ^ 2 P 2 ), we 
finally get (27). 


C-4- Proof of Lemma 4 


For deriving an asymptotic expansion of 4>k{{ 1 — n)~ 1 (W — W*), k~ 1 W*) using Lemma 2, we 
replace W and W in (23), respectively, by (1 — k)~ 1 (W — W*) and k~ 1 W*. We define G*, 
H*, A*, and A*; they correspond to those with hat but W is replaced by (1 — k)~ 1 (W — W*). 
The key equations are G* = (1 - k)~ 1 X t (M - M*)X, H* = (1 - k)~ 1 X t (W - W*)X , 
A* T G*A* = Ip, and A* 1 H*A* = A*. The regularization terms are now represented by g* = 
A* T AGA* and h* = A* t AHA*. For W = k~ 1 W\ we put 

S* = k- 1 A* t X t (M* - W*)XA*. 


Then g , h, A and S, respectively, in Lemma 2 are replaced by g* , h* , A* and S*. Noticing 
g* = 0( 7), h* = 0( 7) and S* = 0(1), the asymptotic orders of the terms in Lemma 2 remain 
the same. (23) is now written like 


■M(l - k)-'(W - W-), K- 1 W) = bU + Y, 2(4 - a; - fe* t ) + ■ ■ ■. (45) 

i^k 

where terms are omitted for saving the space but all the terms in (23) will be calculated below. 
In order to take E*(-\W) of (45) later, we define 


AG* = G* - G, AH* — H* — H, g* = A t AG*A, h* = A t AH*A 


for describing change from A to A*. Also define 

S* = k~ 1 A t X t (M* - W*)XA 

for W = k 1 W*. They are expressed in terms of 


AW* = W* - kW, AM* = M* - kM 
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as AG* = -(1 -k)- 1 X t AM*X, AH* = -(1-k)~ 1 X t AW*X, g* = -(1 - k)- x Y t AM*Y, 
h* = -(1 - k)~ 1 Y t AW*Y , and S* = k~ 1 Y t (AM* - AW*)Y + I P - A. The elements of 
g\ h* and S* are expressed using the notation of Section 6.4 as 

9ij = - (i -«)-' JfsiY^AwL, k„ = -(i - -r 1 a; m 

l>m l>m , \ 

__ ^ „ (46) 

A =«-' E< s re - «ire)re + <s«(i - a,). 

l>m 

It follows from the argument below that g* = 0(N~ 1 ), h* = 0(1V _1 ), and S* = 0(1). Note 
that A w* lm = w* lm - KWi m = ( z* m - K)wim from (10), and so E*(Aw1 m \W) = 0 and 

E *^*lmAw* Vm ,\W) = Sii'Smm'Ki 1 - K,)w? m . (47) 

Both g*j and h*j are of the form ^|> m /imAwjm with fi m = 0(A _1 ), where the num¬ 
ber of nonzero terms is O(N) in the summation Yh l>r n■ Thus V*(^ J>m fi m Awf m \ W) = 
Ez> m fLV*(Aw; m \W) = «(1 - *) E;>m fLtfm = 0 (kN~ 2 N) = O^N~ 2 j = 0(N~ 2 ). There¬ 
fore £j>m flrnAvj* m = 0(N~ 1 ). 

The change from A to A* is expressed as 


A* = A(Ip + C*), A* = A + AA*. 


The elements of C* = (c-) and AA* = diag(AA*, • • • r AAp) are given by Lemma 1 with 
7 = A -1 . In particular (17), (19) and (21) becomes 

AA* = -(^A,-^) + 0(A- 2 P), 
c*i^-\g*i + 0(N- 2 P), (48) 

c*j = (\ - A,)- 1 ^ - K) + 0(A- 2 P). 


Note that the roles of g and h in Lemma 1 are now played by g* and h*, and therefore the 
expressions of C and AA in Lemma 1 give those of C* and AA* . 

Using the above results, A*, g *, h* and S* in (45) are expressed as follows. \* k = \y + AX* k = 
Afc + CKIV- 1 ). g* = ( I P + C*) T g(I P + C *) = g + gC*+ C* T g + C* T gC* =g + 0{N~' 1 P), and 
similarly h* = h+0(N~AP). S* = {I P +C*) T S*(I P +C*) = S*+S*C*+C* T S*+C* T S*C* = 
S* + 0(N~ 1 P). In particular the diagonal elements of S* are 


s lk ~ $*kk + 2 ^ Sjk&jk + 2 s* kk c* kk + 0(N 2 P 2 ) 




s*kk + 2 ^(Aj - A k ) l s* jk (g* k A fc - h* k ) - s* kk g* kk + 0(N 2 P 2 ). 
j¥=k 


( 49 ) 
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Therefore, (45) is now expressed as 

M( 1 - - w*), = 4k + Yl 2 (^ - ^V 1§ *ik(9ikh - hik) 


i^k 


X^j hik) H ~ ^ik^9ki^i ^/ci) 

i^k 

+ ££A Afc) (A/ Afc) ^ (c/ij Afc h{j ) itl jk A/,- hjk) 

ijkk jAk 

+ 4j(dik^k ~ hik)(gjk^k — hjk) \ + 0(-/V * 7 P 2 + 7 3 P 3 ). 


We take P*(-|W) of the above formula. Noting P*(s* J -|W) = <5^(1 — A*)., we have 

C W = E*{s* kk \W) - ]T(A, - A,) 1 (/y,/,'A/. - Zi*) 2 + O^-^P 2 + 7 3 J P 3 )- 

i^k 

Comparing this with (24), and using (49), we get 




1 - 7 - £"(<4|W0 + OiN-^P 2 + 7 3 P 3 ) 

- 2^7 - - h) k )\W) 

jAk 

+ 0(N~ 2 P 2 + N~ lr yP 2 + 7 3 P 3 ). 


Finally, this gives (29) by using (46) and (47). 

For taking E(-) of (29), we now attempt to rewrite the terms. Notice Y = Y(Ip + C) — 
Y + 0(N~ l P) = 0(N~ l t 2 ), we have yikVji = yikVji + 0(N~ 3 ^ 2 P) = O^N" 1 ). Also we have 
A. = \ + 0(N-n>). (29) is now 4>t(W) -<t>?(W) = E,>„ <{-(<?[?]“- «[V],“)5[V]S + 
Ej** 2(Aj - A t )-'(e[F]?‘ - H[?£)(S[?]i% - «[?£)} +0(Af-’/ 2 P 2 +yp 3 ). Comparing 
this with (28), we obtain (30) by using E{w 2 m ) = eta 2 m . 
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Table 1 

Notations of mathematical symbols 


Symbol 

Description 

Section 

Equation 

N 

the number of data vectors 

1 


P 

the dimensions of data vector 

1 


K 

the dimensions of transformed vector 

1 


Xi , X 

data vector, data matrix 

1 


y t ,Y 

transformed vector, transformed matrix 

1 


A 

linear transformation matrix 

1 


Wij, W 

matching weights, matching weight matrix 

1 


Wij, W 

true matching weights, true matching weight matrix 

1 

(1) 

e 

link sampling probability 

1 

(1) 

TOj, M 

row sums of matching weights 

3.1 


yk 

fc-th component of transformed matrix 

3.1 


<t>k 

matching error of fc-th component 

3.1, 4.1 


A, Y, G, H 

matrices for the optimization without regularization 

3.2, 6.3 


AG, AH 

regularization matrices AG = 7 mLm, A H = "/wLw 

3.3 


G, H 

matrices for the optimization with regularization 

3.3 


'U'ki Afc 

eigenvector, eigenvalue 

3.3 


K+ 

the number of positive eigenvalues 

3.3 


bk 

rescaling factor of fc-th component 

3.3 


w^, W* 

matching weights for cross-validation 

4.2 

(10) 

K 

link resampling probability 

4.2 

(10) 

i 

node sampling probability 

4.3 

(11) 

V 

node resampling probability 

4.3 

(12) 

7 

regularization parameter 

6.1 

(A4) 

J 

maximum value of fc for evaluating 

6.1, 6.2 

(A5) 

Xi . A 

eigenvalues without regularization 

6.3 


g, h 

matrices representing regularization 

6.3 


AXj, A A 

changes in eigenvalues due to regularization 

6.3 


Cjj ■ C 

changes in A due to regularization 

6.3 


h X j , Scij 

higher order terms of AA*, c t j 

6.3 


Sij , S 

defined from W and Y 

6.3 


A, Y, G, H 

matrices for the optimization with respect to eW 

6.4 


X i • A 

eigenvalues with respect to eW 

6.4 


A wij, AW, AM 

AW = W - eW, AM = M eM 

6.4 


AG, AH, g, h 

matrices representing the change AW 

6.4 


mi, mi 

coefficients for representing (jij , h t j 

6.4 

(25) 

Cjj, C, AXi, A A 

for representing change from A to A 

C.3 


A s^, AS 

defined from AW and Y 

C.3 


g, h 

representing regularization with respect to eW 

C.3 


G*, H*, A*, A*, g*, h* 

G, H, A, A, g, h for training dataset in cross-validation 

C.4 


a* Q* c* 

°ij i ^ ^ 

Aw*, AW*, AM* 

for test dataset in cross-validation 

C.4 


AW* = W* - kW, AM* = M* - kM 

C.4 


AG*, AH*, g*, h* 

matrices representing the change AW* 

C.4 


c*,., G*, AA*, A A* 

for representing change from A to A* 

C.4 

(48) 
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Table 2 

Parameters of data generation for experiments 


Exp. 

Sampling 

W 

e, e 

1 

link 

Wa 

0.02 

2 

link 

w A 

0.04 

3 

link 

W A 

0.08 

4 

link 

W b 

0.02 

5 

link 

W b 

0.04 

6 

link 

W b 

0.08 

7 

node 

W a 

0.02 

8 

node 

W a 

0.04 

9 

node 

W a 

0.08 

10 

node 

W b 

0.02 

11 

node 

W b 

0.04 

12 

node 

W b 

0.08 



PC2 
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-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 


PCI 


Fig 1. Scatter plot of the first two components of CDMCA for the MNIST handwritten digits. Shown are 300 
digit images randomly sampled from the 60000 training images (d = 1). Digit labels (d = 2) and attribute labels 
(d = 3) are also shown in the same “common space” of K = 2. 
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gamma values 



(a) classification errors (K = 9) 


(b) matching errors (K = 9) 


Fig 2. CDMCA results for 7m = 10~ 4 ,10 -3 ,10 —2 , lO^ 1 ,10°, 10 1 . (a) Classification error of predicting digit 
labels (d = 2) from the 10000 test images, and that for predicting attribute labels (d = 3). (b) The true matching 
error of 10000 test images, and the fitting error and the cross-validation error computed from the 60000 training 
images. 



Column 

Dimensions: 875 x 875 




(a) Wa (b) W of Experiment 1 (c) W of Experiment 9 

Fig 3. Regular matching weights, (a) A true matching weight of a regular structure, (b) Link sampling from 
Wa- (c) Node sampling from Wa- 
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(b) W of Experiment 5 


(c) W of Experiment 11 


Fig 4. Power-law matching weights, (a) A true matching weight of a power-law structure, (b) Link sampling 
from Wb ■ (c) Node sampling from Wb ■ 



Domain 2 
Domain 3 


?V 7 V 


* m. 




\ -r 


0.001 o.oi 


(a) 7 m = 0 


(b) 7 M = 0.1 


(c) matching errors (weighted) 


Fig 5. Experiment 1. (a) and (b) Scatter plots of the first two components of CDMCA with 7 m = 0 and 
7m = 0.1 respectively, (c) True matching error ofW a, and the other matching errors computed from W. 




- 2-10 1 2 



(a) 7 m = 0 


(b) 7 m = 0.1 


(c) matching errors (unweighted) 


Fig 6. Experiment 5. (a) and (b) Scatter plots of the first two components of CDMCA with 7 m = 0 and 
7m = 0.1 respectively, (c) True matching error ofWs, Q-nd the other matching errors computed from W. 
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experiments 


experiments 


(a) fitting error 


(b) cv error (link) 


(c) cv error (node) 


Fig 7. Bias of the matching errors of the components rescaled by the weighted variance (2). 



(a) fitting error 


(b) cv error (link) 


(c) cv error (node) 


Fig 8. Bias of the matching errors of the components rescaled by the unweighted variance (9). 


























































